[Common] Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK by phu0ngng · Pull Request #3150 · NVIDIA/TransformerEngine

phu0ngng · 2026-06-26T11:05:14Z

Description

Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-06-26T11:07:06Z

Greptile Summary

This PR bumps the 3rdparty/nccl submodule to pick up a downstream fix for MAX_SUPPORTED_TOKENS_PER_RANK, and updates four EP test/bench launcher scripts to use a consistent NVLink detection method.

Submodule bump (808d2433 → a6b5de08): pulls in the NCCL fix for MAX_SUPPORTED_TOKENS_PER_RANK that affects Expert Parallelism collectives.
NVLink detection standardized: all four EP scripts now use nvidia-smi nvlink --status 2>/dev/null | grep -qE 'Link [0-9]+:.*GB/s' instead of the topology-matrix check (nvidia-smi topo -m), confirming links are active (showing real bandwidth) rather than merely present in the topology table.
Three scripts (cpp, jax test, jax bench) gain the NVLink guard for the first time; the PyTorch script replaces the old topology-based guard with the new one.

Confidence Score: 5/5

Safe to merge — the submodule bump is a targeted bug fix and the shell script changes only tighten the existing skip guards.

The only code change is a submodule pointer update and consistent NVLink detection guards across four launcher scripts. The new detection method (checking for active link bandwidth via nvidia-smi nvlink --status) is strictly more precise than the old topology-matrix check, and the worst failure mode is an over-skip on an unusual hardware configuration rather than a hang or data corruption. No TE library code is modified.

No files require special attention.

Important Files Changed

Filename	Overview
3rdparty/nccl	Submodule pointer updated to a6b5de08 to include the MAX_SUPPORTED_TOKENS_PER_RANK fix; no TE-side code changes required.
tests/pytorch/distributed/run_test_ep.sh	Replaced nvidia-smi topo -m topology check with nvidia-smi nvlink --status active-link check; logic and skip semantics are preserved.
tests/cpp_distributed/run_test_ep.sh	Adds NVLink active-link guard (new for this script) using the standardized nvidia-smi nvlink --status pattern.
tests/jax/multi_process_launch_ep.sh	Adds NVLink active-link guard (new for this script) placed correctly after the GPU-count check.
examples/jax/ep/bench/run_ep_bench.sh	Adds NVLink active-link guard (new for this script) placed correctly after the GPU-count check.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[EP test/bench script starts] --> B{GPU count >= 4?}
    B -- No --> C[SKIP: not enough GPUs]
    B -- Yes --> D{nvidia-smi nvlink --status\nmatches 'Link N:.*GB/s'?}
    D -- No --> E[SKIP: NVLink not active\nPCIe-only or unsupported]
    D -- Yes --> F[Run EP test / bench]
    F --> G{Exit code == 0?}
    G -- Yes --> H[PASS]
    G -- No --> I[FAIL]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[EP test/bench script starts] --> B{GPU count >= 4?}
    B -- No --> C[SKIP: not enough GPUs]
    B -- Yes --> D{nvidia-smi nvlink --status\nmatches 'Link N:.*GB/s'?}
    D -- No --> E[SKIP: NVLink not active\nPCIe-only or unsupported]
    D -- Yes --> F[Run EP test / bench]
    F --> G{Exit code == 0?}
    G -- Yes --> H[PASS]
    G -- No --> I[FAIL]

_{Reviews (6): Last reviewed commit: "Detect active NVLink via nvlink --status..." | Re-trigger Greptile}

phu0ngng · 2026-06-26T12:19:17Z

/te-ci L1

phu0ngng · 2026-06-27T09:41:57Z

/te-ci L1

phu0ngng · 2026-06-29T06:18:13Z

/te-ci L1

phu0ngng · 2026-06-29T13:40:15Z

/te-ci L1

jberchtold-nvidia

LGTM, thanks!

phu0ngng · 2026-06-29T16:41:20Z

Pipeline #56213025

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng · 2026-06-29T19:49:25Z

/te-ci L1

…NS_PER_RANK (#3150) * nccl with relax num_dispatch_tokens%64!=0 Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * Skip EP tests/examples on nodes without NVLink Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

greptile-apps Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread 3rdparty/nccl Outdated

phu0ngng added the 2.17 label Jun 26, 2026

phu0ngng requested a review from jberchtold-nvidia June 27, 2026 09:18

jberchtold-nvidia previously approved these changes Jun 29, 2026

View reviewed changes

phu0ngng added 5 commits June 29, 2026 11:52

update nccl

f1618df

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

nccl with relax num_dispatch_tokens%64!=0

a810ad6

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Skip EP tests/examples on nodes without NVLink

57a6013

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

cleanup

8c803f6

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Detect active NVLink via nvlink --status link bandwidth in EP scripts

d4055aa

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

phu0ngng dismissed jberchtold-nvidia’s stale review via d4055aa June 29, 2026 19:48

phu0ngng force-pushed the update_nccl branch from 22da482 to d4055aa Compare June 29, 2026 19:48

phu0ngng requested a review from jberchtold-nvidia June 29, 2026 19:48

jberchtold-nvidia approved these changes Jun 29, 2026

View reviewed changes

phu0ngng merged commit 90baf02 into NVIDIA:main Jun 30, 2026
46 of 54 checks passed

phu0ngng deleted the update_nccl branch June 30, 2026 07:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Common] Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK#3150

[Common] Update NCCL submodule to have the fix for MAX_SUPPORTED_TOKENS_PER_RANK#3150
phu0ngng merged 5 commits into
NVIDIA:mainfrom
phu0ngng:update_nccl

phu0ngng commented Jun 26, 2026

Uh oh!

greptile-apps Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

phu0ngng commented Jun 26, 2026

Uh oh!

phu0ngng commented Jun 27, 2026

Uh oh!

phu0ngng commented Jun 29, 2026

Uh oh!

phu0ngng commented Jun 29, 2026

Uh oh!

jberchtold-nvidia left a comment

Uh oh!

phu0ngng commented Jun 29, 2026

Uh oh!

phu0ngng commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

phu0ngng commented Jun 26, 2026

Description

Type of change

Checklist:

Uh oh!

greptile-apps Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

phu0ngng commented Jun 26, 2026

Uh oh!

phu0ngng commented Jun 27, 2026

Uh oh!

phu0ngng commented Jun 29, 2026

Uh oh!

phu0ngng commented Jun 29, 2026

Uh oh!

jberchtold-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

phu0ngng commented Jun 29, 2026

Uh oh!

phu0ngng commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 26, 2026 •

edited

Loading